AITopics | memory bottleneck

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Neural Information Processing SystemsApr-24-2026, 19:12:26 GMT

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult.

artificial intelligence, deep learning, machine learning, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Neural Information Processing SystemsDec-25-2025, 12:06:43 GMT

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L(log L)^2) memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

memory bottleneck, name change, transformer, (5 more...)

Neural Information Processing Systems

Industry: Energy > Power Industry (0.60)

Technology:

Information Technology > Modeling & Simulation (0.60)
Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Memory-efficient Patch-based Inference for Tiny Deep Learning

Neural Information Processing SystemsDec-23-2025, 19:08:20 GMT

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult.

memory-efficient patch-based inference, name change, tiny deep learning, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.62)

Add feedback

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Oh, Hyunwoo, Nam, KyungIn, Bhattacharjya, Rajat, Chen, Hanning, Das, Tamoghno, Yun, Sanggeon, Jang, Suyeon, Ding, Andrew, Dutt, Nikil, Imani, Mohsen

arXiv.org Artificial IntelligenceNov-18-2025

Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.13676

Country:

Europe (1.00)
North America > United States > California (0.28)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification

He, Yefei, Chen, Feng, Liu, Jing, Shao, Wenqi, Zhou, Hong, Zhang, Kaipeng, Zhuang, Bohan

arXiv.org Artificial IntelligenceDec-18-2024

The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform sparse attention mechanism solely on those important tokens, reducing the latency in the prefill phase. Tokens deemed less important will be discarded to reduce KV cache size, alleviating the memory bottleneck in the decoding phase. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.3$\times$ and improve decoding throughput by 2.8$\times$, with a minimal accuracy reduction of only 0.5\% on VQAv2 benchmark over LLaVA-Next-13B model, effectively enhancing the generation efficiency of LVLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.08584

Country:

Asia > China > Shanghai > Shanghai (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > Australia > South Australia > Adelaide (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Neural Information Processing SystemsOct-10-2024, 05:05:32 GMT

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot- product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L(log L) 2) memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget.

memory bottleneck, time series forecasting, transformer, (3 more...)

Neural Information Processing Systems

Industry: Energy > Power Industry (0.63)

Technology:

Information Technology > Data Science > Data Mining (0.65)
Information Technology > Modeling & Simulation (0.63)
Information Technology > Artificial Intelligence > Machine Learning (0.43)

Add feedback

Memory-efficient Patch-based Inference for Tiny Deep Learning

Neural Information Processing SystemsOct-9-2024, 14:02:01 GMT

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.

computation overhead, memory-efficient patch-based inference, tiny deep learning, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

TinyML is bringing deep learning models to microcontrollers

#artificialintelligenceJan-22-2022, 23:35:19 GMT

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence. Deep learning models owe their initial success to large servers with large amounts of memory and clusters of GPUs. The promises of deep learning gave rise to an entire industry of cloud computing services for deep neural networks. Consequently, very large neural networks running on virtually unlimited cloud resources became very popular, especially among wealthy tech companies that can foot the bill. But at the same time, recent years have also seen a reverse trend, a concerted effort to create machine learning models for edge devices.

deep learning model, microcontroller, neural network, (14 more...)

#artificialintelligence

Country: North America > United States > Massachusetts (0.05)

Industry: Information Technology (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Li, Shiyang, Jin, Xiaoyong, Xuan, Yao, Zhou, Xiyou, Chen, Wenhu, Wang, Yu-Xiang, Yan, Xifeng

Neural Information Processing SystemsMar-18-2020, 22:33:16 GMT

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot- product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L(log L) 2) memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget.

memory bottleneck, time series forecasting, transformer, (3 more...)

Neural Information Processing Systems

Industry: Energy > Power Industry (0.63)

Technology:

Information Technology > Data Science > Data Mining (0.65)
Information Technology > Modeling & Simulation (0.63)
Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

3D Electronic Nose Demostrates Advantages of Carbon Nanotubes

IEEE Spectrum RoboticsSep-13-2017, 15:10:33 GMT

You'd think computers spend most of their time and energy doing, well, computation. But that's not the case: about 90 percent of a computer's execution time and electrical energy is spent transferring data between the processor and the memory banks, says Subhasish Mitra, a computer scientist at Stanford University. Even if Moore's law continued on indefinitely, computers would still be limited by this memory bottleneck. This week in the journal Nature, Mitra and collaborators describe a new computer architecture they say addresses this problem--and that Mitra believes will improve both the energy efficiency and speed of computers by a factor of 1000. The new 3D architecture is based on novel devices including 2 million carbon nanotube transistors and over 1 million resistive RAM cells, all built on top of a layer of silicon using existing fabrication methods and connected by densely packed metal wiring between the layers. As a demonstration, the team built an electronic nose that can sense and identify several common vapors including lemon juice, rubbing alcohol, vodka, wine, and beer.

artificial intelligence, interconnect, shulaker, (11 more...)

IEEE Spectrum Robotics

Industry:

Materials (0.70)
Semiconductors & Electronics (0.70)

Technology: Information Technology > Artificial Intelligence > Robots (0.76)

Add feedback

Filters

Collaborating Authors

memory bottleneck

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Memory-efficient Patch-based Inference for Tiny Deep Learning

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Memory-efficient Patch-based Inference for Tiny Deep Learning

TinyML is bringing deep learning models to microcontrollers

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

3D Electronic Nose Demostrates Advantages of Carbon Nanotubes